Generating Complementary Acoustic Model Spaces in DNN-Based Sequence-to-Frame DTW Scheme for Out-of-Vocabulary Spoken Term Detection
نویسندگان
چکیده
This paper proposes a sequence-to-frame dynamic time warping (DTW) combination approach to improve out-ofvocabulary (OOV) spoken term detection (STD) performance gain. The goal of this paper is twofold: first, we propose a method that directly adopts the posterior probability of deep neural network (DNN) and Gaussian mixture model (GMM) as the similarity distance for sequence-to-frame DTW. Second, we investigate combinations of diverse schemes in GMM and DNN, with different subword units and acoustic models, estimate the complementarity in terms of performance gap and correlation of the combined systems, and discuss the performance gain of the combined systems. The results of evaluations conducted of the combined systems on an out-ofvocabulary spoken term detection task show that the performance gain of DNN-based systems is better than that of GMM-based systems. However, the performance gain obtained by combining DNNand GMM-based systems is insignificant, even though DNN and GMM are highly heterogeneous. This is because the performance gap between DNN-based systems and GMM-based systems is quite large. On the other hand, score fusion of two heterogeneous subword units, triphone and sub-phonetic segments, in DNN-based systems provides significantly improved performance.
منابع مشابه
Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملUtilizing state-level distance vector representation for improved spoken term detection by text and spoken queries
In spoken term detection (STD) systems, approximate subwordlevel matching of query term and automatically transcribed spoken documents is often employed for its reasonable accuracy and efficiency. However, high out-of-vocabulary (OOV) rate often degrades the subword-level recognition accuracy and affect the STD performance. This paper describes the usage of new expanded acoustic representations...
متن کاملCombining State-Level Spotting and Posterior-Based Acoustic Match for Improved Query-by-Example Spoken Term Detection
In spoken term detection (STD) systems, automatic speech recognition (ASR) frontend is often employed for its reasonable accuracy and efficiency. However, out-of-vocabulary (OOV) problem at ASR stage has a great impact on the STD performance for spoken query. In this paper, we propose combining feature-based acoustic match which is often employed in the STD systems for low resource languages, a...
متن کاملQuery-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram
Spoken term detection in low-resource situations is a challenging problem, because traditional large vocabulary continuous speech recognition (LVCSR) approaches are often unusable. This paper introduces a method to use deep neural network (DNN) softmax outputs as input features in a query-by-example (QBE) spoken term detection (STD) system. Matches between queries and test utterances are locate...
متن کاملUnsupervised spoken-term detection with spoken queries using segment-based dynamic time warping
Spoken term detection is important for retrieval of multimedia and spoken content over the Internet. Because it is difficult to have acoustic/language models well matched to the huge quantities of spoken documents produced under various conditions, unsupervised approaches using frame-based dynamic time warping (DTW) has been proposed to compare the spoken query with spoken documents frame by fr...
متن کامل